feat: Dubai villa lead scraper + Apify bridge + Sheets sync by vishnu-madhavan-git · Pull Request #1 · ixruby/9ruby-automation

vishnu-madhavan-git · 2026-03-08T02:37:17Z

Summary

Adds automated Dubai villa owner lead collection for IXR interior design client acquisition.

New files

agents/dubai_villa_scraper.py — stdlib-only scraper (PropertyFinder + Bayut)
agents/apify_dubai_scraper.py — Apify-powered scraper (faster, handles anti-bot)
core/leads-bridge.js — syncs leads to Google Sheets CRM

Usage

# Basic (no API key needed)
python agents/dubai_villa_scraper.py --area "Palm Jumeirah" --max 30

# Apify (faster, needs APIFY_TOKEN in .env)
python agents/apify_dubai_scraper.py --max 100

Sheets sync

const { syncLeadsToSheets } = require('./core/leads-bridge');
await syncLeadsToSheets();

Summary by CodeRabbit

New Features
- Dubai villa lead scraper: Collect leads from PropertyFinder and Bayut without API key requirements.
- Apify scraper integration: Enhanced scraping via Apify's Dubai Real Estate actor (requires APIFY_TOKEN).
- Google Sheets sync: Automatically sync scraped leads to Google Sheets, deduplicating by contact information.
Documentation
- Updated README with villa scraper setup, usage examples, and Sheets integration instructions.

- agents/dubai_villa_scraper.py: stdlib-only scraper for PropertyFinder + Bayut - Extracts UAE phone numbers, owner names, areas, prices - Deduplicates by phone number - Outputs JSON to data/state/villa_leads.json - agents/apify_dubai_scraper.py: Apify actor bridge (faster path) - Uses redoubtable_bubble/dubai-real-estate-scraper actor - Handles anti-bot automatically - Requires APIFY_TOKEN in .env - core/leads-bridge.js: syncs villa_leads.json to Google Sheets CRM - Deduplicates synced leads - Uses existing SheetsService pattern Use case: IXR interior design client acquisition from Dubai villa owners

coderabbitai · 2026-03-08T02:37:33Z

📝 Walkthrough

Walkthrough

A Dubai villa lead scraping system is added with multiple collection options: PropertyFinder/Bayut (no API key required) and Apify-based scraping. Includes deduplication logic, JSON persistence at data/state/villa_leads.json, and a Google Sheets sync bridge via core/leads-bridge.js. Documentation and dependencies updated accordingly.

Changes

Cohort / File(s)	Summary
Documentation & Configuration `README.md`, `agents/requirements.txt`	Added Dubai Villa Lead Scraper section with usage examples, environment setup, and clarified dependency notes (stdlib-only base, optional requests/beautifulsoup4).
Villa Scraping Agents `agents/dubai_villa_scraper.py`, `agents/apify_dubai_scraper.py`	Implemented two scraping strategies: PropertyFinder/Bayut direct web scraping with retry/backoff, and Apify actor integration. Both include lead normalization, phone extraction, deduplication against existing leads, and JSON persistence.
Lead Sync Integration `core/leads-bridge.js`	New Leads Bridge module that filters synced leads by phone, initializes Sheets integration, and asynchronously syncs new villa leads with area, price, source, and URL as notes.

Sequence Diagrams

sequenceDiagram
    participant DLS as Dubai Villa Scraper
    participant PF as PropertyFinder/Bayut
    participant Store as JSON Store
    participant Dedup as Deduplicator
    
    DLS->>PF: Fetch listings (area, pagination)
    PF-->>DLS: HTML response
    DLS->>DLS: Parse & extract leads<br/>(phone, price, URL, area)
    DLS->>Store: Load existing leads
    Store-->>DLS: villa_leads.json
    DLS->>Dedup: Check for duplicates<br/>(by phone number)
    Dedup-->>DLS: New leads, dup count
    DLS->>Store: Append & save updated list
    Store-->>DLS: Persisted
    DLS->>DLS: Output summary JSON<br/>(new, total, duplicates)

sequenceDiagram
    participant Bridge as Leads Bridge
    participant VL as villa_leads.json
    participant SL as synced_leads.json
    participant Sheets as Google Sheets
    
    Bridge->>VL: Read all villa leads
    VL-->>Bridge: Lead array
    Bridge->>SL: Load synced phones
    SL-->>Bridge: Phone list
    Bridge->>Bridge: Filter new leads<br/>(exclude synced)
    Bridge->>Sheets: Initialize API
    loop For each new lead
        Bridge->>Sheets: Sync lead<br/>(name, area, price, URL)
        Sheets-->>Bridge: Success/Failure
    end
    Bridge->>SL: Append synced phones
    SL-->>Bridge: Updated
    Bridge->>Bridge: Return sync summary

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 In Dubai's desert, where villas bloom,
I scrape the leads from room to room,
PropertyFinder, Bayut, or Apify's call,
Dedup them quick, sync to Sheets—I've got 'em all!
Phone numbers dance, coordinates gleam,
A villa lead symphony, coded in dream! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: adding a Dubai villa lead scraper, Apify integration, and Google Sheets sync functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/dubai-villa-scraper

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

Devin Review found 4 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-03-08T02:39:43Z

+}
+
+async function syncLeadsToSheets() {
+  const sheetsService = new (require("./sheets"))();


🔴 new (require('./sheets'))() fails because sheets.js exports a singleton instance, not a class

core/sheets.js:74 exports module.exports = new SheetsService() — a singleton instance, not the SheetsService class. In core/leads-bridge.js:30, new (require('./sheets'))() attempts to call new on that instance, which will throw TypeError: (intermediate value) is not a constructor at runtime. The function syncLeadsToSheets() can never succeed.

Suggested change

const sheetsService = new (require("./sheets"))();

const sheetsService = require("./sheets");

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-08T02:39:44Z

+  // Mark all as synced
+  const newSynced = [...loadSynced(), ...unsynced];
+  saveSynced(newSynced);


🔴 Failed-to-sync leads are incorrectly marked as synced

At core/leads-bridge.js:58, all unsynced leads are unconditionally added to the synced list, even if sheetsService.addLead() threw an error for some of them (caught at line 52). This means leads that failed to sync to Google Sheets will never be retried on subsequent runs, since they'll already be in synced_leads.json. The code should only mark successfully synced leads.

Prompt for agents

In core/leads-bridge.js, lines 43-59: Track which leads were successfully synced in an array (e.g., `successfullySynced`), push each lead to it inside the try block after the synced++ increment (around line 51), and then on line 58, use that array instead of `unsynced` when building `newSynced`. This ensures only leads that were actually synced to Google Sheets get marked in synced_leads.json.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-08T02:39:45Z

+def fetch_results(token: str, run_id: str) -> list[dict]:
+    """Fetch results from completed actor run."""
+    result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
+    items = result.get("data", {}).get("items", [])
+    log(f"Fetched {len(items)} items from Apify")
+    return items


🔴 Apify dataset items API returns a JSON array, not a {data: {items: []}} object

The Apify API endpoint /v2/actor-runs/{runId}/dataset/items returns a raw JSON array of items, not an object with a data.items wrapper. At agents/apify_dubai_scraper.py:109-110, apify_request parses the response with json.loads() which yields a Python list. Then result.get("data", {}).get("items", []) will raise AttributeError: 'list' object has no attribute 'get' because lists don't have .get(). This means fetch_results() will always crash after a successful actor run.

Suggested change

def fetch_results(token: str, run_id: str) -> list[dict]:

"""Fetch results from completed actor run."""

result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)

items = result.get("data", {}).get("items", [])

log(f"Fetched {len(items)} items from Apify")

return items

def fetch_results(token: str, run_id: str) -> list[dict]:

"""Fetch results from completed actor run."""

result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)

if isinstance(result, list):

items = result

else:

items = result.get("data", {}).get("items", [])

log(f"Fetched {len(items)} items from Apify")

return items

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-08T02:39:46Z

+    encoded_area = urllib.parse.quote(area) if area else "dubai"
+    url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"


🔴 Bayut URL construction double-encodes area: urllib.parse.quote() then .replace(' ', '-') is a no-op

At agents/dubai_villa_scraper.py:219-220, when area is provided, encoded_area = urllib.parse.quote(area) converts spaces to %20 (e.g., "Palm Jumeirah" → "Palm%20Jumeirah"). The subsequent .lower().replace(' ', '-') on line 220 is then a no-op since there are no literal spaces left, producing a URL like .../villa/palm%20jumeirah/ instead of the expected .../villa/palm-jumeirah/. This will likely return a 404 or wrong results from Bayut.

Suggested change

encoded_area = urllib.parse.quote(area) if area else "dubai"

url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"

area_slug = area.lower().replace(" ", "-") if area else "dubai"

url = f"https://www.bayut.com/for-rent/villa/{area_slug}/?owner_only=1"

Was this helpful? React with 👍 or 👎 to provide feedback.

Copilot

Pull request overview

Adds an automated lead-generation pipeline for Dubai villa listings, including two Python-based scrapers (stdlib + Apify) and a Node bridge that syncs collected leads into the existing Google Sheets CRM integration.

Changes:

Add agents/dubai_villa_scraper.py to scrape PropertyFinder/Bayut and persist leads to data/state/villa_leads.json.
Add agents/apify_dubai_scraper.py to pull owner-contact leads via an Apify actor and persist them in the same state file.
Add core/leads-bridge.js plus README updates to sync stored leads into Google Sheets.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
core/leads-bridge.js	New bridge that loads leads from `data/state` and pushes unsynced ones to Google Sheets.
agents/requirements.txt	Updates comments describing scraper dependency expectations.
agents/dubai_villa_scraper.py	New stdlib-only scraper that collects and deduplicates leads into state JSON.
agents/apify_dubai_scraper.py	New Apify API bridge that runs an actor, normalizes results, and writes leads to state JSON.
README.md	Documents how to run both scrapers and how to sync leads to Sheets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-08T02:43:44Z

+}
+
+async function syncLeadsToSheets() {
+  const sheetsService = new (require("./sheets"))();


core/sheets.js exports an already-instantiated SheetsService (module.exports = new SheetsService()), but this code treats it like a constructor (new (require("./sheets"))()). That will throw at runtime. Use the exported instance directly and call init()/addLead() on it.

Suggested change

const sheetsService = new (require("./sheets"))();

const sheetsService = sheets;

Copilot · 2026-03-08T02:43:44Z

+  // Mark all as synced
+  const newSynced = [...loadSynced(), ...unsynced];
+  saveSynced(newSynced);


This marks all unsynced leads as synced even if some addLead calls fail (errors are caught but the lead is still appended to synced_leads.json). This will prevent retries and can permanently drop leads from being synced. Track only successfully synced leads (or update the synced file incrementally on each success) before writing synced_leads.json.

Copilot · 2026-03-08T02:43:44Z

+    encoded_area = urllib.parse.quote(area) if area else "dubai"
+    url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"


Bayut URL construction mixes quote() with a slug-style path. For areas like "Palm Jumeirah", urllib.parse.quote(area) yields Palm%20Jumeirah, so the URL becomes /villa/palm%20jumeirah/… rather than the expected hyphenated slug. Build the slug first (lowercase + replace spaces with '-') and then quote (or avoid quoting if the slug is URL-safe) to ensure the listings page resolves correctly.

Suggested change

encoded_area = urllib.parse.quote(area) if area else "dubai"

url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"

if area:

# Build a Bayut-style slug: lowercase and hyphen-separated, then URL-encode if needed.

area_slug = area.strip().lower().replace(" ", "-")

encoded_area = urllib.parse.quote(area_slug, safe="-")

else:

encoded_area = "dubai"

url = f"https://www.bayut.com/for-rent/villa/{encoded_area}/?owner_only=1"

Copilot · 2026-03-08T02:43:44Z

+const sheets = require("./sheets");
+


const sheets = require("./sheets"); is unused (and syncLeadsToSheets requires the module again later). After fixing the SheetsService instantiation, keep a single import and use it to avoid duplication and dead code.

Copilot · 2026-03-08T02:43:45Z

+    area_slug = area.lower().replace(" ", "-") if area else ""
+    base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y"
+    if area_slug:


area_slug is computed but never used. Remove it or use it consistently when building the query/location parameter to keep the scraping logic clear.

Suggested change

area_slug = area.lower().replace(" ", "-") if area else ""

base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y"

if area_slug:

base_url = "https://www.propertyfinder.ae/en/search?c=2&t=1&fu=1&rp=y"

if area:

Copilot · 2026-03-08T02:43:45Z

+        # Simpler extraction - look for villa data in JSON-LD or meta tags
+        # PropertyFinder embeds listing data as JSON
+        json_matches = re.findall(r'window\.__INITIAL_STATE__\s*=\s*({.*?});\s*</script>', html, re.DOTALL)
+        if not json_matches:
+            json_matches = re.findall(r'"properties"\s*:\s*(\[.*?\])', html, re.DOTALL)
+


json_matches is computed but never used. If the intent is to parse window.__INITIAL_STATE__ for structured listing data (more robust than regex), consider implementing that; otherwise remove this block to reduce dead code.

Suggested change

# Simpler extraction - look for villa data in JSON-LD or meta tags

# PropertyFinder embeds listing data as JSON

json_matches = re.findall(r'window\.__INITIAL_STATE__\s*=\s*({.*?});\s*</script>', html, re.DOTALL)

if not json_matches:

json_matches = re.findall(r'"properties"\s*:\s*(\[.*?\])', html, re.DOTALL)

Copilot · 2026-03-08T02:43:45Z

+def fetch_url(url: str, headers: dict = None) -> str | None:
+    """Fetch a URL with retry logic."""
+    default_headers = {
+        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
+        "Accept": "text/html,application/xhtml+xml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",


fetch_url uses the PEP 604 union type (str | None), which requires Python 3.10+. The repo README currently targets “system python” without a minimum version, and other agents avoid this syntax. Consider switching to Optional[str] (and similar) or explicitly documenting/enforcing Python >= 3.10 for agents.

Copilot · 2026-03-08T02:43:45Z

+    for p in phones:
+        clean = re.sub(r'[\s\-]', '', p)
+        if clean not in normalized:
+            normalized.append(clean)
+    return normalized


deduplicate() keys off lead["phone"], but normalization here only strips spaces/dashes. The same UAE number can appear as 00971..., +971..., or local 05... and won’t dedupe correctly. Normalize to a single canonical format (e.g. E.164 +971...) before adding to normalized.

Copilot · 2026-03-08T02:43:45Z

+def run_actor(token: str, area: str, max_items: int, property_type: str = "villa") -> str | None:
+    """Start the Apify actor run and return run ID."""


This file also uses PEP 604 union types in return annotations (e.g. str | None, dict | None), which require Python 3.10+. If agents are intended to run on an unspecified “system python”, consider using Optional[...]/Union[...] or documenting/enforcing Python >= 3.10.

Copilot · 2026-03-08T02:43:46Z

+def fetch_results(token: str, run_id: str) -> list[dict]:
+    """Fetch results from completed actor run."""
+    result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
+    items = result.get("data", {}).get("items", [])


fetch_results assumes apify_request returns an object with data.items, but the dataset items endpoint commonly returns a raw JSON array. If a list is returned, result.get(...) will throw and the scraper will crash. Adjust apify_request/fetch_results to handle a list response (or request a response shape that’s always an object).

Suggested change

items = result.get("data", {}).get("items", [])

# The Apify dataset items endpoint may return either:

# - a raw JSON array of items, or

# - an object that wraps items under data.items.

if isinstance(result, list):

items = result

elif isinstance(result, dict):

data = result.get("data", result)

if isinstance(data, dict):

items = data.get("items", [])

elif isinstance(data, list):

items = data

else:

items = []

else:

items = []

coderabbitai

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agents/apify_dubai_scraper.py`:
- Around line 7-12: The script documents that APIFY_TOKEN can live in a .env but
only reads os.environ; fix by loading .env at module start (before any env
access) using python-dotenv: add "from dotenv import load_dotenv" and call
"load_dotenv()" near the top of the file before retrieving APIFY_TOKEN (and
similarly before the env checks around the block referenced at lines ~175-180).
Alternatively, update the CLI docs to remove the .env claim — but the preferred
fix is to call load_dotenv() before accessing APIFY_TOKEN so os.environ sees
variables from .env.
- Around line 148-158: Concurrent runs overwrite the shared JSON leads store
because load_existing_leads and save_leads read the whole file, mutate
in-memory, and rewrite it; change these functions to perform concurrency-safe
updates (e.g., acquire a file lock around read-modify-write or switch to an
append-only/JSONL writer) so overlapping scrapers don't lose data. Specifically,
update load_existing_leads, save_leads and any callers that append to LEADS_FILE
so they obtain an exclusive lock on LEADS_FILE (or open it in append mode for
JSONL) before reading/writing, merge new leads into the existing set safely, and
use atomic replace/rename when writing to avoid partial writes; ensure the same
locking/format is used by agents/dubai_villa_scraper.py to keep behavior
consistent.
- Around line 107-111: The fetch_results function assumes apify_request returned
a dict and calls .get on it, but the /actor-runs/{run_id}/dataset/items endpoint
returns a raw list; update fetch_results to handle both shapes by checking the
type of the response from apify_request (called in fetch_results) and set items
= result if it's a list, otherwise fall back to result.get("data",
{}).get("items", []); keep the log(f"Fetched {len(items)} items from Apify") and
ensure the function returns the items list.

In `@agents/dubai_villa_scraper.py`:
- Around line 49-59: The read-modify-write on LEADS_FILE in
load_existing_leads/save_leads causes lost-update races; wrap the entire
sequence that reads, merges new leads, and writes back with a file-level lock
(e.g., using a FileLock/portalocker) so concurrent agents serialize access to
LEADS_FILE, perform a read->merge-by-unique-id->write under the same lock, and
write atomically (write to temp + os.replace) to avoid corruption; apply the
same locking/merge pattern to the other lead-write logic referenced around the
second block (lines 302-319) and reference the functions load_existing_leads and
save_leads when locating the code to change.
- Around line 219-237: The scraper only fetches the first Bayut search page so
it under-collects when max_results is larger; modify the listing collection
around encoded_area/url/fetch_url/listing_urls to paginate (add a page
parameter, e.g. append "&page={page}" to the existing query), loop fetching
pages starting at page=1, extract and accumulate unique listing URLs into
listing_urls until you have >= max_results or no new URLs are returned, and then
slice to max_results before iterating to create leads; ensure you break on empty
pages and update logs to show pages fetched.

In `@core/leads-bridge.js`:
- Around line 43-59: The code currently appends all unsynced leads to persistent
storage regardless of whether sheetsService.addLead succeeded; change the logic
to only persist leads that were actually synced by collecting successful leads
(e.g., push to a local array inside the try block or incrementally build
syncedLeads) and then call saveSynced([...loadSynced(), ...syncedLeads]) instead
of saveSynced([...loadSynced(), ...unsynced]); keep the existing try/catch
around sheetsService.addLead and error logging for failed attempts so transient
failures remain retryable.
- Around line 29-31: The syncLeadsToSheets function instantiates the Sheets
module with new (require("./sheets"))() which is incorrect because
core/sheets.js exports a singleton; change the code to require/import the
exported sheets instance (the exported symbol from "./sheets") and use that
instance (e.g., sheets.init()) instead of creating a new object; if you actually
need the class, update core/sheets.js to also export the class constructor and
require that specific export, but by default reuse the exported sheets singleton
rather than calling new on require("./sheets").

In `@README.md`:
- Around line 57-70: Update the README to explicitly require Python 3.10+ for
running the scraper commands: add a short note above the examples stating
"Requires Python 3.10+" (or "Python 3.10 or later") because the agents
(agents/dubai_villa_scraper.py and agents/apify_dubai_scraper.py) use PEP 604
union syntax (e.g., str | None) and newer parameterized built-ins; also keep the
APIFY_TOKEN note for the Apify scraper so users know environment setup
requirements.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aee2bf30-f927-4cf6-aaf6-4e1e0c9efd24

📥 Commits

Reviewing files that changed from the base of the PR and between f0349c5 and 883f1f8.

📒 Files selected for processing (5)

README.md
agents/apify_dubai_scraper.py
agents/dubai_villa_scraper.py
agents/requirements.txt
core/leads-bridge.js

coderabbitai · 2026-03-08T02:50:32Z

+This is the FAST path - uses Apify's actor which handles anti-bot measures.
+Requires APIFY_TOKEN in .env
+
+Usage:
+    python agents/apify_dubai_scraper.py
+    python agents/apify_dubai_scraper.py --area "Palm Jumeirah" --max 100


⚠️ Potential issue | 🟠 Major

Load .env here or change the documented CLI contract.

The docstring says APIFY_TOKEN can live in .env, but this entry point only checks os.environ. Running python agents/apify_dubai_scraper.py as documented will still hit APIFY_TOKEN not set unless the caller exported the variable first.

Also applies to: 175-180

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@agents/apify_dubai_scraper.py` around lines 7 - 12, The script documents that APIFY_TOKEN can live in a .env but only reads os.environ; fix by loading .env at module start (before any env access) using python-dotenv: add "from dotenv import load_dotenv" and call "load_dotenv()" near the top of the file before retrieving APIFY_TOKEN (and similarly before the env checks around the block referenced at lines ~175-180). Alternatively, update the CLI docs to remove the .env claim — but the preferred fix is to call load_dotenv() before accessing APIFY_TOKEN so os.environ sees variables from .env.

coderabbitai · 2026-03-08T02:50:32Z

+def fetch_results(token: str, run_id: str) -> list[dict]:
+    """Fetch results from completed actor run."""
+    result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)
+    items = result.get("data", {}).get("items", [])
+    log(f"Fetched {len(items)} items from Apify")


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and inspect the apify_dubai_scraper.py file find . -name "apify_dubai_scraper.py" -type f

Repository: vishnu-madhavan-git/automation

Length of output: 105

🌐 Web query:

Apify API dataset-items endpoint response format /actor-runs/{runId}/dataset/items

💡 Result:

GET /v2/actor-runs/{runId}/dataset/items is the Actor-run “default dataset” items endpoint and it behaves the same as the regular Get dataset items endpoint (same query params + same output formats). (docs.apify.com)

Default response (no format param)

format=json (default): returns a JSON array of item objects, e.g. [{...},{...}]. (docs.apify.com)

Other response formats (format=...)

format=jsonl: newline-delimited JSON (one JSON object per line, ending with \n). (docs.apify.com)

format=csv: CSV text (header\nrow\n...), with optional UTF‑8 BOM behavior controlled by bom=0|1. (docs.apify.com)

Also supported: html, xlsx, xml, rss (returned as the corresponding textual/binary representation). (docs.apify.com)

Pagination metadata (via response headers)

The items response includes these headers:

X-Apify-Pagination-Offset

X-Apify-Pagination-Limit

X-Apify-Pagination-Count

X-Apify-Pagination-Total (docs.apify.com)

Common query params that affect the output shape

offset, limit

fields (pick only listed fields), omit

unwind, flatten

clean (shortcut for skipHidden=true + skipEmpty=true) (docs.apify.com)

Citations:

1: https://docs.apify.com/api/v2/actor-run-get

2: https://docs.apify.com/api/v2/dataset-items-get

3: https://docs.apify.com/api/v2/dataset-items-get

4: https://docs.apify.com/api/v2/dataset-items-get

5: https://docs.apify.com/api/v2/dataset-items-get

6: https://docs.apify.com/api/v2/dataset-items-get

7: https://docs.apify.com/api/v2/dataset-items-get

🏁 Script executed:

# Search for the apify_request function to understand response handling rg "def apify_request" -A 10 -t py

Repository: vishnu-madhavan-git/automation

Length of output: 978

🏁 Script executed:

# Inspect the actual fetch_results function in the file to confirm code and line numbers cat -n agents/apify_dubai_scraper.py | sed -n '100,120p'

Repository: vishnu-madhavan-git/automation

Length of output: 936

🏁 Script executed:

# Also check the full context around the function to understand usage cat -n agents/apify_dubai_scraper.py | sed -n '1,30p'

Repository: vishnu-madhavan-git/automation

Length of output: 1126

🏁 Script executed:

# Check the complete apify_request function to see full implementation cat -n agents/apify_dubai_scraper.py | sed -n '40,60p'

Repository: vishnu-madhavan-git/automation

Length of output: 1041

🏁 Script executed:

# Verify if there are any other usages of apify_request to understand the response pattern rg "apify_request" agents/apify_dubai_scraper.py -B 1 -A 1

Repository: vishnu-madhavan-git/automation

Length of output: 653

Treat the dataset-items response as a raw list.

Apify's /actor-runs/{runId}/dataset/items endpoint returns a JSON array directly (e.g. [{...},{...}]), not a wrapped object. The current code calls .get("data", {}) on the response, which will fail with AttributeError: 'list' object has no attribute 'get'. This differs from other Apify endpoints in this file (e.g. /actor-runs/{run_id}) which return wrapped responses.

Proposed fix

def fetch_results(token: str, run_id: str) -> list[dict]: """Fetch results from completed actor run.""" - result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) - items = result.get("data", {}).get("items", []) + items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token) + if not isinstance(items, list): + log(f"Unexpected dataset response: {items}") + return [] log(f"Fetched {len(items)} items from Apify") return items

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def fetch_results(token: str, run_id: str) -> list[dict]:

"""Fetch results from completed actor run."""

result = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)

items = result.get("data", {}).get("items", [])

log(f"Fetched {len(items)} items from Apify")

def fetch_results(token: str, run_id: str) -> list[dict]:

"""Fetch results from completed actor run."""

items = apify_request("GET", f"/actor-runs/{run_id}/dataset/items", token)

if not isinstance(items, list):

log(f"Unexpected dataset response: {items}")

return []

log(f"Fetched {len(items)} items from Apify")

return items

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@agents/apify_dubai_scraper.py` around lines 107 - 111, The fetch_results function assumes apify_request returned a dict and calls .get on it, but the /actor-runs/{run_id}/dataset/items endpoint returns a raw list; update fetch_results to handle both shapes by checking the type of the response from apify_request (called in fetch_results) and set items = result if it's a list, otherwise fall back to result.get("data", {}).get("items", []); keep the log(f"Fetched {len(items)} items from Apify") and ensure the function returns the items list.

coderabbitai · 2026-03-08T02:50:32Z

+def load_existing_leads() -> list:
+    if LEADS_FILE.exists():
+        try:
+            return json.loads(LEADS_FILE.read_text(encoding="utf-8"))
+        except Exception:
+            return []
+    return []
+
+
+def save_leads(leads: list) -> None:
+    LEADS_FILE.write_text(json.dumps(leads, indent=2, ensure_ascii=False), encoding="utf-8")


⚠️ Potential issue | 🟠 Major

Serialize writes to the shared leads store.

This code loads data/state/villa_leads.json, appends in memory, and rewrites the whole file. agents/dubai_villa_scraper.py does the same against the same path, so overlapping runs will drop whichever scraper saves first.

Also applies to: 198-201

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 152-152: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@agents/apify_dubai_scraper.py` around lines 148 - 158, Concurrent runs overwrite the shared JSON leads store because load_existing_leads and save_leads read the whole file, mutate in-memory, and rewrite it; change these functions to perform concurrency-safe updates (e.g., acquire a file lock around read-modify-write or switch to an append-only/JSONL writer) so overlapping scrapers don't lose data. Specifically, update load_existing_leads, save_leads and any callers that append to LEADS_FILE so they obtain an exclusive lock on LEADS_FILE (or open it in append mode for JSONL) before reading/writing, merge new leads into the existing set safely, and use atomic replace/rename when writing to avoid partial writes; ensure the same locking/format is used by agents/dubai_villa_scraper.py to keep behavior consistent.

coderabbitai · 2026-03-08T02:50:32Z

+def load_existing_leads() -> list:
+    if LEADS_FILE.exists():
+        try:
+            return json.loads(LEADS_FILE.read_text(encoding="utf-8"))
+        except Exception:
+            return []
+    return []
+
+
+def save_leads(leads: list) -> None:
+    LEADS_FILE.write_text(json.dumps(leads, indent=2, ensure_ascii=False), encoding="utf-8")


⚠️ Potential issue | 🟠 Major

The shared leads store still has a lost-update race.

Like the Apify agent, this flow rewrites data/state/villa_leads.json via a read-modify-write sequence with no locking. If both agents overlap, the later write drops the earlier agent’s newly found leads.

Also applies to: 302-319

🧰 Tools

🪛 Ruff (0.15.4)

[warning] 53-53: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@agents/dubai_villa_scraper.py` around lines 49 - 59, The read-modify-write on LEADS_FILE in load_existing_leads/save_leads causes lost-update races; wrap the entire sequence that reads, merges new leads, and writes back with a file-level lock (e.g., using a FileLock/portalocker) so concurrent agents serialize access to LEADS_FILE, perform a read->merge-by-unique-id->write under the same lock, and write atomically (write to temp + os.replace) to avoid corruption; apply the same locking/merge pattern to the other lead-write logic referenced around the second block (lines 302-319) and reference the functions load_existing_leads and save_leads when locating the code to change.

coderabbitai · 2026-03-08T02:50:32Z

+    encoded_area = urllib.parse.quote(area) if area else "dubai"
+    url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"
+
+    log(f"Bayut: starting scrape (area={area or 'Dubai'}, max={max_results})")
+    html = fetch_url(url)
+
+    if not html:
+        log("Bayut: failed to fetch listings page")
+        return leads
+
+    # Extract listing links
+    listing_urls = re.findall(r'"(https://www\.bayut\.com/property/[^"]+)"', html)
+    if not listing_urls:
+        listing_urls = re.findall(r'href="(/property/[^"]+)"', html)
+        listing_urls = [f"https://www.bayut.com{u}" for u in listing_urls]
+
+    log(f"Bayut: found {len(listing_urls)} listing URLs")
+
+    for listing_url in listing_urls[:max_results]:


⚠️ Potential issue | 🟠 Major

Paginate Bayut results before honoring --max.

This implementation requests exactly one Bayut search page and never advances a page parameter. Any run that needs more leads than that first page exposes will silently under-collect, even though --max suggests otherwise.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@agents/dubai_villa_scraper.py` around lines 219 - 237, The scraper only fetches the first Bayut search page so it under-collects when max_results is larger; modify the listing collection around encoded_area/url/fetch_url/listing_urls to paginate (add a page parameter, e.g. append "&page={page}" to the existing query), loop fetching pages starting at page=1, extract and accumulate unique listing URLs into listing_urls until you have >= max_results or no new URLs are returned, and then slice to max_results before iterating to create leads; ensure you break on empty pages and update logs to show pages fetched.

coderabbitai · 2026-03-08T02:50:32Z

+async function syncLeadsToSheets() {
+  const sheetsService = new (require("./sheets"))();
+  const initialized = await sheetsService.init();


⚠️ Potential issue | 🔴 Critical

Use the exported Sheets singleton instead of constructing it.

core/sheets.js already exports a service instance. Calling new (require("./sheets"))() here will throw before any sync happens; reuse the imported sheets object or export the class separately.

Proposed fix

async function syncLeadsToSheets() { - const sheetsService = new (require("./sheets"))(); + const sheetsService = sheets; const initialized = await sheetsService.init();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/leads-bridge.js` around lines 29 - 31, The syncLeadsToSheets function instantiates the Sheets module with new (require("./sheets"))() which is incorrect because core/sheets.js exports a singleton; change the code to require/import the exported sheets instance (the exported symbol from "./sheets") and use that instance (e.g., sheets.init()) instead of creating a new object; if you actually need the class, update core/sheets.js to also export the class constructor and require that specific export, but by default reuse the exported sheets singleton rather than calling new on require("./sheets").

coderabbitai · 2026-03-08T02:50:32Z

+  let synced = 0;
+  for (const lead of unsynced) {
+    try {
+      await sheetsService.addLead({
+        name: lead.name,
+        phone: lead.phone,
+        notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}`
+      });
+      synced++;
+    } catch (err) {
+      console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message);
+    }
+  }
+
+  // Mark all as synced
+  const newSynced = [...loadSynced(), ...unsynced];
+  saveSynced(newSynced);


⚠️ Potential issue | 🔴 Critical

Only persist leads that actually synced.

saveSynced() currently appends every unsynced lead, even when addLead() threw. A transient Sheets failure will therefore permanently suppress retries for that lead.

Proposed fix

let synced = 0; + const syncedNow = []; for (const lead of unsynced) { try { await sheetsService.addLead({ name: lead.name, phone: lead.phone, notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}` }); synced++; + syncedNow.push(lead); } catch (err) { console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message); } } // Mark all as synced - const newSynced = [...loadSynced(), ...unsynced]; + const newSynced = [...loadSynced(), ...syncedNow]; saveSynced(newSynced);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let synced = 0;

for (const lead of unsynced) {

try {

await sheetsService.addLead({

name: lead.name,

phone: lead.phone,

notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}`

});

synced++;

} catch (err) {

console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message);

}

}

// Mark all as synced

const newSynced = [...loadSynced(), ...unsynced];

saveSynced(newSynced);

let synced = 0;

const syncedNow = [];

for (const lead of unsynced) {

try {

await sheetsService.addLead({

name: lead.name,

phone: lead.phone,

notes: `Area: ${lead.area} | Price: ${lead.price} | Source: ${lead.source} | ${lead.url}`

});

synced++;

syncedNow.push(lead);

} catch (err) {

console.error(`[leads-bridge] Failed to sync ${lead.phone}:`, err.message);

}

}

// Mark all as synced

const newSynced = [...loadSynced(), ...syncedNow];

saveSynced(newSynced);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@core/leads-bridge.js` around lines 43 - 59, The code currently appends all unsynced leads to persistent storage regardless of whether sheetsService.addLead succeeded; change the logic to only persist leads that were actually synced by collecting successful leads (e.g., push to a local array inside the try block or incrementally build syncedLeads) and then call saveSynced([...loadSynced(), ...syncedLeads]) instead of saveSynced([...loadSynced(), ...unsynced]); keep the existing try/catch around sheetsService.addLead and error logging for failed attempts so transient failures remain retryable.

coderabbitai · 2026-03-08T02:50:32Z

+## Dubai Villa Lead Scraper
+
+Two scraper agents for collecting direct villa owner contacts in Dubai:
+
+### Basic Scraper (no API key needed)
+```bash
+python agents/dubai_villa_scraper.py --source both --area "Palm Jumeirah" --max 30
+```
+
+### Apify Scraper (faster, more reliable)
+Requires `APIFY_TOKEN` in `.env`
+```bash
+python agents/apify_dubai_scraper.py --area "Emirates Hills" --max 100
+```


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find the agents directory and list its contents find . -name "agents" -type d | head -20 git ls-files | grep -E "agents/" | head -20

Repository: vishnu-madhavan-git/automation

Length of output: 182

🏁 Script executed:

# Check if the agent files exist ls -la agents/ 2>/dev/null || echo "agents/ directory not found at root"

Repository: vishnu-madhavan-git/automation

Length of output: 465

🏁 Script executed:

# Search for these specific files find . -name "dubai_villa_scraper.py" -o -name "apify_dubai_scraper.py"

Repository: vishnu-madhavan-git/automation

Length of output: 137

🏁 Script executed:

# Check for version requirements in common config files cat pyproject.toml 2>/dev/null | grep -i python || echo "No python requirement in pyproject.toml" cat setup.py 2>/dev/null | grep -i python || echo "No setup.py found" cat setup.cfg 2>/dev/null | grep -i python || echo "No setup.cfg found"

Repository: vishnu-madhavan-git/automation

Length of output: 150

🏁 Script executed:

# Check the agent files for parameterized built-ins and union syntax wc -l agents/dubai_villa_scraper.py agents/apify_dubai_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 157

🏁 Script executed:

# Search for parameterized built-ins (list[...], dict[...], etc.) and union syntax (X | Y) rg "list\[|dict\[|tuple\[|set\[|\s\|\s" agents/

Repository: vishnu-madhavan-git/automation

Length of output: 1402

🏁 Script executed:

# Check the first 50 lines of each agent file for imports and type hints head -50 agents/dubai_villa_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 1482

🏁 Script executed:

# Check the apify agent file as well head -50 agents/apify_dubai_scraper.py

Repository: vishnu-madhavan-git/automation

Length of output: 1584

Specify Python 3.10+ requirement for these commands.

The agents use PEP 604 union syntax (str | None) and parameterized built-ins (list[str], dict, etc.), which require Python 3.10+ and 3.9+ respectively. The README shows unqualified python commands, so users with system Python < 3.10 will fail at runtime. Add a note requiring Python 3.10+.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 57 - 70, Update the README to explicitly require Python 3.10+ for running the scraper commands: add a short note above the examples stating "Requires Python 3.10+" (or "Python 3.10 or later") because the agents (agents/dubai_villa_scraper.py and agents/apify_dubai_scraper.py) use PEP 604 union syntax (e.g., str | None) and newer parameterized built-ins; also keep the APIFY_TOKEN note for the Apify scraper so users know environment setup requirements.

Copilot AI review requested due to automatic review settings March 8, 2026 02:37

Copilot started reviewing on behalf of vishnu-madhavan-git March 8, 2026 02:37 View session

devin-ai-integration Bot reviewed Mar 8, 2026

View reviewed changes

Copilot AI reviewed Mar 8, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 8, 2026

View reviewed changes

	const sheetsService = new (require("./sheets"))();
	const sheetsService = require("./sheets");

		encoded_area = urllib.parse.quote(area) if area else "dubai"
		url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"

	const sheetsService = new (require("./sheets"))();
	const sheetsService = sheets;

-    encoded_area = urllib.parse.quote(area) if area else "dubai"
-    url = f"https://www.bayut.com/for-rent/villa/{encoded_area.lower().replace(' ', '-')}/?owner_only=1"
+    if area:
+        # Build a Bayut-style slug: lowercase and hyphen-separated, then URL-encode if needed.
+        area_slug = area.strip().lower().replace(" ", "-")
+        encoded_area = urllib.parse.quote(area_slug, safe="-")
+    else:
+        encoded_area = "dubai"
+    url = f"https://www.bayut.com/for-rent/villa/{encoded_area}/?owner_only=1"

		def run_actor(token: str, area: str, max_items: int, property_type: str = "villa") -> str \| None:
		"""Start the Apify actor run and return run ID."""

-    items = result.get("data", {}).get("items", [])
+    # The Apify dataset items endpoint may return either:
+    # - a raw JSON array of items, or
+    # - an object that wraps items under data.items.
+    if isinstance(result, list):
+        items = result
+    elif isinstance(result, dict):
+        data = result.get("data", result)
+        if isinstance(data, dict):
+            items = data.get("items", [])
+        elif isinstance(data, list):
+            items = data
+        else:
+            items = []
+    else:
+        items = []

Conversation

vishnu-madhavan-git commented Mar 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New files

Usage

Sheets sync

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 8, 2026

Choose a reason for hiding this comment

Default response (no format param)

Other response formats (format=...)

Pagination metadata (via response headers)

Common query params that affect the output shape

vishnu-madhavan-git commented Mar 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 8, 2026 •

edited

Loading

Default response (no `format` param)

Other response formats (`format=...`)